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ABSTRACT 

This paper presents a novel approach for enhancing the multiple sets 
of acoustic patterns automatically discovered from a given corpus. 
In a previous work it was proposed that different HMM configura¬ 
tions (number of states per model, number of distinct models) for 
the acoustic patterns form a two-dimensional space. Multiple sets of 
acoustic patterns automatically discovered with the HMM configura¬ 
tions properly located on different points over this two-dimensional 
space were shown to be complementary to one another, jointly cap¬ 
turing the characteristics of the given corpus. By representing the 
given corpus as sequences of acoustic patterns on different HMM 
sets, the pattern indices in these sequences can be relabeled consid¬ 
ering the context consistency across the different sequences. Good 
improvements were observed in preliminary experiments of pattern 
spoken term detection (STD) performed on both TIMIT and Man¬ 
darin Broadcast News with such enhanced patterns. 

Index Terms — zero-resourced speech recognition, unsupervised 
learning, acoustic patterns, hidden Markov models, spoken term de¬ 
tection 

1. INTRODUCTION 

Supervised training of HMMs for large vocabulary continuous speech 
recognition (LVCSR) relies on not only collecting huge quantities 
of acoustic data, but also obtaining the corresponding transcriptions. 
Such supervised training methods yield adequate performance in 
most circumstances but at high cost, and in many situations such 
annotated data sets are simply not available. This is why substantial 
effort dEHaHEllllCl has been made for unsupervised discov¬ 
ery of acoustic patterns from huge quantities of acoustic data without 
annotation, which may be easily obtained nowadays. For some ap¬ 
plications such as Spoken Term Detection (STD) ifell^ lfTollTTllT^ 
in which the goal is simply to match and find some signal seg¬ 
ments, the extra effort of building an LVCSR system using cor¬ 
pora with human annotations is very often an unnecessary burden 
lfT3llfT4ll[T5ll[T6ll[T7l . M ost effort of unsupervised discovery of 
acoustic patterns considered only one level of phoneme-like acoustic 
patterns. However, it is well known that speech signals have multi¬ 
level structures including at least phonemes and words, and such 
structures are very helpful in analysing or decoding speech fl2l . In 
a previous work, we proposed to discover the hierarchical structure 
of two-level acoustic patterns, including subword-like and word-like 
patterns. A similar two-level framework was also developed recently 
im. In a more recent attempt CD, we further proposed a frame¬ 
work of discovering multi-level acoustic patterns with varying model 
granularity. The different pattern HMM configurations (number of 
states per model, number of distinct models) form a two-dimensional 


model granularity space. Different sets of acoustic patterns with 
HMM model configurations represented by different points properly 
distributed over this two-dimensional space are complementary to 
one another, thus jointly capture the characteristics of the corpora 
considered. Such a multi-level framework was shown to be very 
helpful in the task of unsupervised spoken term detection (STD) 
with spoken queries, because token matching can be performed with 
pattern indices on different levels of signal characteristics, and the 
information integration across multiple model granularities offered 
the improved performance. 

In this work, we further propose an enhanced version of the 
multi-level acoustic patterns with varying model granularity by con¬ 
sidering the context consistency for the decoded pattern sequences 
within each level and across different levels. In other words, the 
acoustic patterns discovered on different levels are no longer trained 
completely independently. We try to “relabel” the pattern sequence 
for each utterance in the training corpora considering the context 
consistency within and across levels. For a certain level, the context 
consistency may indicate that the realizations of a certain pattern 
should be split into two different patterns, while the realizations of 
another two patterns should be merged. In this way the multi-level 
acoustic patterns can be enhanced. 

2. PROPOSED APPROACH 

2.1. Pattern Discovery for a Given Model Configuration 

Given an unlabeled speech corpus, it is not difficult for unsuper¬ 
vised discovery of the desired acoustic patterns from the corpus for 
a chosen hyperparameter set ^ that determines the HMM configu¬ 
ration (number of states per model and number of distinct models) 
GlEKSKllllol. This can be achieved by first finding an initial la¬ 
bel cco based on a set of assumed patterns for all observations in the 
corpus X as in 0121. Then in each iteration t the HMM parameter 
set of can be trained with the label uJt-i obtained in the previous 
iteration as in ([^, and the new label ujt can be obtained by pattern 
decoding with the obtained parameter set Of as in ([^. 


CCo 

= initialization (x), 

(1) 

et 

= argmaxP(x|6'’^,Wt-i), 

(2) 


= argmaxP(x ,cc). 

OJ 

(3) 


The training process can be repeated with enough number of itera¬ 
tions until a converged set of pattern HMMs is obtained. 


Phonetic Granularity(n): 

Number of acoustic pattern HMMs 



Fig. 1: Model granularity space for acoustic pattern configurations 

2.2. Model Granularity Space for Multi-level Pattern Sets 

The above process can be performed with many different HMM con¬ 
figurations, each characterized by two hyperparameters: the number 
of states m in each acoustic pattern HMM, and the total number of 
distinct acoustic patterns n during initialization, '0 = The 

transcription of a signal decoded with these patterns can be consid¬ 
ered as a temporal segmentation of the signal, so the HMM length 
(or number of states in each HMM) m represents the temporal gran¬ 
ularity. The set of all distinct acoustic patterns can be considered as 
a segmentation of the phonetic space, so the total number n of dis¬ 
tinct acoustic patterns represents the phonetic granularity. This gives 
a two-dimensional representation of the acoustic pattern configura¬ 
tions in terms of temporal and phonetic granularities as in Fig. [T] 
Any point in this two-dimensional space in Fig. [^corresponds to an 
acoustic pattern configuration. Note that in our previous work (13, 
the effect of the third dimension, the acoustic granularity which is the 
number of Gaussians in each state, was shown to be negligible, thus 
here we simply set the number of Gaussians in each state to be 4 in all 
cases. Although the selection of the hyperparameters can be arbitrary 
in this two-dimensional space, here we only select M temporal gran¬ 
ularities and N phonetic granularities, forming a two-dimensional 
array of M x hyperparameter sets in the granularity space. 


2.3. Pattern Relabeling Considering Context Consistency 


Context constraints successfully explored in language modeling can 
be used here for relabeling the acoustic patterns as shown by an ex¬ 
ample in Fig. We assume the patterns ‘b’ and ‘B’ are similar 
without context as in Fig. j^a). However if the context is considered, 
we may observe from the corpus that many realizations of pattern ‘b’ 
is preceded by pattern ‘a’ and followed by pattern ‘c’, while most 
realizations of pattern ‘B’ have different context. Therefore by rela¬ 
beling all realizations of pattern ‘B’ which are preceded by pattern ‘a’ 
and followed by pattern ‘c’ as pattern ‘b’, the contrast between pat¬ 
terns ‘b’ and ‘B’ can be enhanced during the next iteration of acoustic 
model update as shown in Fig. [^b) since the borderline cases have 
been resolved. As shown in Fig. ic), this relabeling includes both 
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Fig. 2: Pattern relabeling considering context consistency 
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Fig. 3: Local smoothing considering granularity context 

pattern splitting and merging, since the realizations of pattern ‘B’ are 
split into two patterns ‘B’ and ‘b’, while some realizations of pattern 
‘B’ are merged into pattern ‘b’. The example here considers the con¬ 
text in time, but can be generalized to context in model granularities 
as explained below. 

As shown in Fig. [^ assuming an utterance is decoded into four 
different pattern sequences using four sets of patterns with neigh¬ 
boring temporal granularity > m 3 > m 2 > mi, i.e., pattern 
HMMs with different lengths. Considering a realization of pattern 
‘b’ of temporal granularity m 3 , we find its central frame belongs to 
the realization of pattern ‘a’ of temporal granularity m^ and the real¬ 
ization of pattern ‘c’ of temporal granularity m 2 . So patterns ‘a’ and 
‘c’ are taken as the context of pattern ‘b’ in neighboring temporal 
granularities. The same could be done for phonetic granularity. 

2.4. Pattern Relabeling Method 

Let uj{mk,nk, 1) be the index for a decoded acoustic pattern at time 
I within an utterance in the corpus x using the acoustic pattern set 
with the granularity 0 (?ti/c, n^). The relabeled pattern uJ{mk,nk,l) 
is then as in ( j^ i.e., the pattern among all patterns in the set of 
'ipjmk , rik) which maximizes the product of the three probabilities in 
([^([^([^ evaluated with the context respectively in I, n, and m. 
The first probability Pi{w) in ( j^ for context in time I is actually 
the product of forward bigram and backward bigram well known in 
language modeling. The other two probabilities Pn{w), Pm{w) in 
([4c]l([4d]l are exactly the same, except rik-i, nk-\-i and mk-i, mk+i 
are the neighboring values of n and m. 

u{mk,nk,l) = 3 iTgindi^{Pi{w)Pn{w)Pm{w)), (4a) 

w 

Pi{w) = P{w\uj{mk,nk,l-\))P{w\uj{mk,nk,M)), (4b) 

Pn{w) = P{w\uj{mk,nk-i,l))P{w\uj{mk,nk+i,l)), (4c) 

Pm(w) = P{w\uj{mk-i,nk,l))P{w\uj{mk+i,nk,l)). (4d) 

Finer patterns and coarser patterns are drastically different in terms 
of perplexity; shorter patterns and longer patterns produce very differ¬ 
ent pattern sequences in terms of duration. They are complementary 
to each other, but we only consider the context consistency among 
the neighboring granularity configurations as in 0. This relabeling 
is performed on every decoded sequence of the M x A pattern sets 
considered. Katz smoothing (2D was applied to deal with unseen 
pattern bigrams. On the boundary of the granularity configurations 
or time sequences, the bigram probability is taken as 1. 

2.5. Pattern Enhancement by Re-estimation after Relabeling 

The relabeling in ( [^ can be inserted into the recursive process of 
discovering the patterns in each iteration in (|^(|^, as shown in 00. 

ujt = argmaxP(a;|a;t), (5) 

LO 

of+i = argmaxP(x|6'’^,wt)- 

6>^ 



(6) 

















































































When an iteration is completed as in a new set of patterns is 

generated as in ([^, with which a new set of labels is obtained as in 
The new labels ut in is then relabeled with ( |^ based on the 
new labels ujt on all different HMM sets to produce a slightly better 
label uJt as in This slightly better label ut is then used in ([§ to 
generate a slightly better model set Note that |6|) is almost the 
same as except here based on the slightly better l^el uh obtained 
in 1^. In this way the relabeling process can be repeatedly applied 
in every iteration, and the patterns can be enhanced by the relabeling 
process during the model re-estimation. Although it is theoretically 
possible to consider the optimization process in and ^ jointly in 
a single step, such as maximizing the product of the two probabilities 
in the right hand sides of ^ and practically such a joint opti¬ 
mization is computationally unfeasible. Therefore this is done in two 
separate steps here. 

2.6. Spoken Term Detection 

There can be various applications for the acoustic patterns presented 
here. In this section we summarize the way to perform spoken term 
detection m. Let {pr, r = l,2,3,..,n} denote the n acoustic pat¬ 
terns in the set of n). We first construct a similarity matrix S 

of size n X n off-line for every pattern set ^={m, n), for which the 
element S{i,j) is the similarity between any two pattern HMMs pi 
and pj in the set. 


S{i,j)=exp{-KL{i,j)/p. (7) 

The KL-divergence KL(i,j) between two pattern HMMs in (m is 
defined as the symmetric KL-divergence between the states based on 
the variational approximation 1221 summed over the states. To trans¬ 
form the KL divergence into a similarity measure between 0 and 1, a 
negative exponential was applied HD with a scaling factor [3. When 
13 is small, similarity between distinct patterns in 0 approaches zero, 
so approaches the delta function (3 can be determined with 

a held out data set, but here we simply set it to 100. 

In the on-line phase, we perform the following for each entered 
spoken query q and each document (utterance) d in the archive for 
each pattern set il)=(rn^ n). Assume for a given pattern set a docu¬ 
ment d is decoded into a sequence of D acoustic patterns with in¬ 
dices ((ii, (^ 2 ,o?z:») and the query q into a sequence of Q patterns 
with indices {qi, ■■■, qq)- We thus construct a matching matrix W of 
size D xQ for every document-query pair, in which each entry {i,j) 
is the similarity between acoustic patterns with indices di and qj as 
in § and shown in Fig. Qa) for a simple example of Q = 3 and 
D = 6, where S{i, j) is defined in (|^, 

W(i,j)^S{d.,qj). ( 8 ) 

It is possible to consider the N-best pattern sequences rather than 
the one-best sequences here by considering the posteriorgram vec¬ 
tors based on the N-best sequences for d, q and integrate them in 
the matrix W. However, previous experiments showed that the extra 
improvements brought in this way is almost negligible, probably be¬ 
cause the M X different pattern sequences based on the M x iV 
different pattern sets can be considered as a huge lattice including 
many one-best paths which will be jointly considered here GH. 
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Fig. 5: Average Gini impurity for the top 20 words with the highest 
counts in the TIMIT training set based on the original patterns (blue) 
and those after relabeling (green), with different values of n and m. 

For matching the sub-sequence of d with q, we sum the elements 
in the matrix W in © along the diagonal direction, generating the 
accumulated similarities for all sub-sequences starting at all pattern 
positions in d as shown in Fig. Qa). The maximum is selected to rep¬ 
resent the relevance between document d and query q on the pattern 
set ^={m, n) as in 

Q 

R{d, q) = max ^ VF(z + j, j). (9) 

j = l 

It is also possible to consider dynamic time warping (DTW) on the 
matrix W as shown in Fig. 0b). However, previous experiments 
showed that the extra improvements brought in this way is almost 
negligible, probably because here we have jointly considered the 
M X N different pattern sequences based on the M x N different 
pattern sets (e.g. including longer /shorter patterns), so the different 
time-warped matching and insertion/deletion between d and q is 
already automatically included CD. 

The MxN relevance scores R[d^ q) in (|^ obtained with MxN 
pattern sets 'ip={m, n) are then averaged and the average scores are 
used in ranking all the documents for spoken term detection. It is 
also possible to learn the weights for different pattern sets to produce 
better results using a development set. But here we simply assume 
the detection is completely unsupervised without any annotation, and 
all pattern sets are equally weighted IT^ . 

3. EXPERIMENTS 

3.1. Purity in Pattern Sequences for known Words 

In order to evaluate the quality of the acoustic patterns we discovered 
with varying temporal and phonetic granularities, we use the Gini 
impurity for the pattern sequences found for known high frequency 
words, since this can be evaluated for any given pattern set. Assume 
all the realizations of a high frequency word (e.g. the word “water”) 
are decoded into / different pattern sequences, each occupying a per¬ 
centage fi of the realizations (T>ifi = 1), we can evaluate the Gini 
impurity ED for the word using the I percentages /={/*, i=l,2,...7} 


Fig. 4: The matching matrix W 















N\imber of distnct patterns Number of distnct patterns 


Search methods(MAP) 

TIMIT 

Mandarin 

(a) frame-based DTW on MFCC 

10.16% 

22.19% 

(b) proposed: original patterns 

26.32% 

23.38% 

(c) proposed: relabeled patterns 

28.26% 

24.50% 


Table 1: Overall spoken term detection performance in mean aver¬ 
age precision. 


Fig. 6: Average Gini impurity for the cluster of words with occur¬ 
rence counts ranging from 16 to 22 with (a) m=3, (b) m=ll. 

as in 

I 

Gini Impurity(/) = fi(l - fi). (10) 

i = l 

Gini impurity falls within the interval [0,1), reaches zero when all 
the realizations are decoded into the same pattern sequence, and be¬ 
comes larger when the distribution is less pure. We trained the above 
different sets of patterns with m=3, 5, 1,9, 11 and n=50, 100, 200, 
300 on the TIMIT training set. Fig. shows the average Gini impu¬ 
rity for the top 20 words with the highest occurrence counts in TIMIT 
training set, based on the original patterns (blue) and those after re¬ 
labeling (green) for all cases considered. We see the impurity was in 
general high for such automatically discovered patterns because the 
realizations of the same phoneme produced different speakers were 
possibly decoded as different patterns, and the insertion/deletion in¬ 
evitably increased the impurity. Although the impurity was high, the 
relabeling proposed here generated better patterns. We see the differ¬ 
ence was more significant for larger m. Because the temporal varia¬ 
tion is easily captured by models with short patterns (m=3 or 5 with 
high impurity) which increases the impurity, much lower impurity 
was achieved with longer patterns (m=9 or 11). 

Another set of results for average Gini impurity for the cluster of 
words with occurrence counts ranging from 16 to 22 in the TIMIT 
training set is shown in Fig. [^for m=3 and 11 states per HMM with 
varying number of distinct patterns (n). It is still quite clear that 
the relabeling process enhanced the patterns, and it is interesting to 
note that the trends for m=3 and 11 are quite different (Fig. [^a) and 
(b)). As mentioned above, the temporal variation is easily captured 
by models with short patterns which increases the impurity (e.g. m=3 
in Fig. I^a)) so increasing the number of patterns (n) helped reduce 
the impurity. However, when the models are long enough (e.g. m=\ 1 
in Fig. [^b)), larger number of patterns(n) gives more redundant pat¬ 
terns which caused confusion during decoding, so the impurity went 
up with larger n. These results indicate that the different sets of pat¬ 
terns of different model granularities were complementary to each 
other. Note that only high frequency words with enough realizations 
can be used or the impurity evaluation here to show the quality of 
the patterns. But how these patterns can be applied to spoken term 
detection will be shown below, for which the queries are usually low 
frequency words, whose impurity is difficult to evaluate. 

3.2. Unsupervised Spoken Term Detection 

We conducted two separate query by example spoken term detec¬ 
tion experiments on two spoken archives. In the first experiment, the 
TIMIT training set was used as the spoken archive and the spoken 
query set consisted of 16 words randomly selected from the TIMIT 
testing set. In the second experiment, the spoken archive was 4.5 
hours of Mandarin Broadcast News segmented into 5034 spoken doc¬ 
uments and the spoken query set was 10 words selected from an¬ 
other development set. In either case, a spoken instance of a query 
word was randomly selected from the data set, and used as the spoken 


query to search for other instances in the spoken archive. The con¬ 
ventional 39 dimensional MFCC features were used for the HMMs. 
20 sets of acoustic patterns were generated for TIMIT with m = 3, 5, 
1,9, 11 and n = 50, 100, 200, 300; 9 sets for the Mandarin Broadcast 
News with m = 3,1, 13 and n = 50, 100, 300; all with 4 Gaussian 
mixtures per state. We compared u{m,n) with uj{m,ri) for each 
(m, n) pair. We used the mean average precision (MAP) 12511^ as 
the performance measure, a higher value implies better performance. 

The MAP performance of each of the 20 pattern sets for TIMIT 
and 9 sets for Mandarin Broadcast News before and after relabeling 
is in Fig. [^a)(b) where the performance was clearly boosted for most 
of the pattern sets. A paired sample t-test was used to check the MAP 
improvement of relabeled pattern sets, t(28)=3.37, p=0.0011, signif¬ 
icant improvement was observed. Note that different from TIMIT 
which had many different speakers, the Mandarin Broadcast News 
was produced by a limited number of anchors, so MAP for each pat¬ 
tern set ranged between 18% to 22%, much higher than TIMIT. Al¬ 
though the MAP for each individual pattern set was relatively low 
on TIMIT (1% to 5%) in general, much better results in MAP can 
be obtained when all of them are jointly considered as rows (b)(c) 
in Table 1. Row (a) in Table was the frame-based dynamic time 
warping (DTW) on MFCC sequences. We see the relabeled patterns 
achieved an MAP of 28.26% and 24.50% which is significantly better 
than that using the original patterns (26.32% and 23.38%). Further 
more, both of them significantly outperformed the baseline (10.16% 
and 22.19%), which proved the improvement was non-trivial. 


4. CONCLUSION 


In this work, we propose a method for improving the quality of multi¬ 
level acoustic patterns discovered from a target corpus. By incor¬ 
porating context consistency in time and model granularity, a more 
consistent set of patterns can be obtained. This is verified with im¬ 
proved performance in spoken term detection on TIMIT and Man¬ 
darin Broadcast News. 



Fig. 7: mean average precision of each of the HMM sets with the 
granularity hyperparameters (m,n) on (a) TIMIT and (b) Mandarin 
Broadcast News. 
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